Missing Data

Neil Lund

2025-01-08

Missing data

Missing data is extremely common in applied research, how should we deal with it?

Missing data

The default here is simply dropping missing data from analyses (listwise deletion), but does this make sense to do?

data|>
  select(-iso2c, -iso3c, -year)|>
  visdat::vis_miss() +
  ggtitle("Missing World Bank Data for 2020") 

lm(libdem_score ~ gini, data=data)|>
  summary()

Call:
lm(formula = libdem_score ~ gini, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56930 -0.16300  0.07276  0.21980  0.39146 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.774470   0.169130   4.579 2.26e-05 ***
gini        -0.006482   0.004760  -1.362    0.178    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2604 on 63 degrees of freedom
  (114 observations deleted due to missingness)
Multiple R-squared:  0.0286,    Adjusted R-squared:  0.01318 
F-statistic: 1.855 on 1 and 63 DF,  p-value: 0.1781

Types of missingness

Our approach to missing data will depend on our theory of why it exists.

  • Missing completely at random (MCAR)
    • no systematic differences between missing and non-missing data
  • Missing at random (MAR)
    • missing is conditional on some characteristics
  • Missing not at random (MNAR)
    • missingness is conditional on something unobserved

Missing completely at random: examples

MCAR means there is genuinely no “pattern” to the missingness.

  • Only a random subset of countries are studied in each “wave”

  • The power went out during a telephone survey

  • You were distracted while coding and made some errors

Or a dog deleted some of your data

Missing at random

MAR means there is a pattern, but its based on something you can predict

  • In our global development data: Gini coefficients are generally not collected for wealthy countries

  • In a study of congressional twitter accounts: older members never sign up for the service

  • In a study of social movement participation: members of vulnerable groups decline to answer a question about protest

All these cases suggest that we can fully account for the characteristic that causes missing data. We have data on wealth for the global development data, we have data on member’s age from our congress study, and we can identify who belongs to vulnerable groups.

Missing not at random

MNAR means that there is something systematic and unobserved that causes the missingness. This is especially tricky if the missingness depends on expected values of the missing data itself.

  • In the development data: if World Bank doesn’t collect data on highly unequal countries

  • In a study of protest effectiveness: smaller events rarely generate news coverage

  • In a study of the effect of education on income, people with low income expectations never enter the labor force

We don’t really even have partial data on the cause of missingness here: we might suspect that “low potential income” cause someone to decide not to enter the labor force, but we never observe wages for that person in the first place so this is potentially untestable.

Effects

  • MCAR: No bias. Missingness just adds “noise”

  • MAR: Potentially biased estimates, but also may be fixable

  • MNAR: Potentially biased estimates, but harder to address

Potentially problematic fixes

Listwise deletion: the default

  • Simply dropping observations with missing data can work as long as the data are missing completely at random.

  • Standard errors will be larger than they would be if you had all those observations, but the slopes are unbiased in a linear model

    • However, that assumption may not hold in a logit or probit model

Indicator method

Set the missing variable to 0 and include an indicator for missingness

  • Can work under certain circumstances, but potentially very biased even under MCAR conditions
data$gini_missing <- is.na(data$gini)
data$gini_zeroes <- ifelse(is.na(data$gini), 0, data$gini)

lm(libdem_score ~ gini_zeroes + gini_missing, data=data)|>
      summary()

Call:
lm(formula = libdem_score ~ gini_zeroes + gini_missing, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56930 -0.19196 -0.02777  0.20530  0.52354 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.774470   0.153152   5.057 1.06e-06 ***
gini_zeroes      -0.006482   0.004310  -1.504  0.13438    
gini_missingTRUE -0.462005   0.154736  -2.986  0.00323 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2358 on 176 degrees of freedom
Multiple R-squared:  0.1989,    Adjusted R-squared:  0.1898 
F-statistic: 21.86 on 2 and 176 DF,  p-value: 3.328e-09

Single imputation

We could set missing values at their means, or could try using a regression model to predict the missing predictors using other data.

  • This would produce unbiased estimates under the MCAR or MAR case if we had the right model

  • But it would seriously underestimate the uncertainty in our data because we fail to incorporate the uncertainty from the imputations

imputation_model<-lm(gini ~ gdp_pcap + pop_over_65 + libdem_score  , data=data)

data$gini_imputed <- predict(imputation_model, newdata=data)

lm(libdem_score ~ gini_imputed, data=data)|>
  summary()

Call:
lm(formula = libdem_score ~ gini_imputed, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50675 -0.21558  0.02165  0.21800  0.50440 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.146892   0.164428   6.975 6.71e-11 ***
gini_imputed -0.019556   0.004322  -4.525 1.14e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2464 on 168 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.1086,    Adjusted R-squared:  0.1033 
F-statistic: 20.48 on 1 and 168 DF,  p-value: 1.138e-05

Better Alternatives

Extreme bounds

One approach is to assume nothing and just ask “how bad could it get?”

  • For categorical variables, assume all the missing observations take on their maximum and minimum possible values
  • For continuous variables, there is no real “maximum” but you could use the maximum observed value, or the mean + two or three standard deviations.
    • You could also look set something more narrow as a “plausible” alternative

Extreme bounds

min_gini<-ifelse(is.na(data$gini), min(data$gini, na.rm=T), data$gini)
max_gini<-ifelse(is.na(data$gini), max(data$gini, na.rm=T), data$gini)

mincase<-lm(libdem_score ~ min_gini, data=data)
observed<-lm(libdem_score ~ gini, data=data)
maxcase<-lm(libdem_score ~ max_gini, data=data)
tinytable_znmoa0r9pkxn6n6dowhr
min observed max
gini 0.011 -0.006 -0.012
[0.005, 0.017] [-0.016, 0.003] [-0.015, -0.008]
Num.Obs. 179 65 179
R2 Adj. 0.073 0.013 0.187
BIC 28.3 20.0 4.9

Extreme bounds

  • Advantage: minimal assumptions (especially for limited extreme bounds on limited DVs)

  • Disadvantage: might overstate uncertainty, and would be plain wrong for MCAR

This really works best when you have a very small number of missing observations. If you can show that your results are similar even under the worst case scenario assumptions, then you’re in a pretty good spot.

MICE

Multiple Imputation Through Chained Equations

  1. Replace all missing data with a placeholder (like the column mean)

  2. Reset one column (call it X) to its observed values

  3. Use a model to impute X

  4. Replace X with the predicted values from step 3, then move on to the next column.

  5. Repeat for all columns with missing data

  6. Do all of these steps multiple times to create multiple versions of the data

  • MICE can handle complex relationships between missing observations

MICE

  • Do a little cleanup and rescaling (we want to drop the other imputed variables, along with un-informative columns like the country name, and log the population)
library(mice)
mdata<-data|>
  filter(!is.na(libdem_score))|>
  select(-country, -iso2c, -iso3c, -year, 
         -gini_zeroes, 
         -gini_imputed,
         -gini_missing
         )|>
  mutate(population = log(population),
         gdp_pcap = log(gdp_pcap)
         )
mice_data<- mice(mdata, maxit = 35, m = 10, seed = 1,      
                 printFlag =FALSE
                 )

MICE

stripplot(mice_data, gini, pch = 19, xlab = "Imputation number")

MICE

Finally, we can fit the data and then pool the results to account for uncertainty in our imputed estimates:

fit <- with(mice_data, lm(libdem_score ~ gini))

summary(pool(fit))
termestimatestd.errorstatisticdfp.value
(Intercept)0.605  0.191  3.1613.60.00722
gini-0.005570.00505-1.1 14.10.288  

In practice, we would probably want to include more information here. Especially if we have things like old data from a previous analysis, we could potentially vastly improve our predictions and limit the amount of uncertainty we had to deal with here.

MICE

  • MICE only works for MCAR or MAR cases. In other words: if we can’t create a regression model that can “explain” the missingness, then it doesn’t solve anything

  • No free lunch: we need at least some data about each row.

  • More missing data = more uncertainty.

Selection problem: truncation of Y

What if people with a certain value of the outcome are never observed? For example: people with low expected earnings may drop out of the labor force entirely. This would cause an underestimate of the impact of education on wages because the lowest earners are never actually part of the equation. Moreover, imputation for these cases would be pretty dubious: we’re already trying to create a regression model to predict the outcome!

Selection problem: truncation of Y

set.seed(500)
N = 1000
educ_year = rpois(N, 15)
earnings  = educ_year * 10 + rnorm(N, 0, 50)
lfp = ifelse(earnings < quantile(earnings, .20), 0, 1)
lfp = factor(lfp, labels=c('not in labor force', 'in labor force'))
df<-data.frame(educ_year, earnings, lfp)
df_comb <- bind_rows(
  df |> mutate(group = "true model"),
  df |> filter(lfp == "in labor force") |> mutate(group = 'observed only')
)

true_model<-lm(earnings ~ educ_year ,data=df)
observed_model<-lm(earnings ~ educ_year ,data=df[which(lfp == "in labor force"),])

ggplot(df_comb, aes(x=educ_year, y=earnings)) + 
  geom_point(aes(fill = lfp), alpha=.5, shape=21, color='black') +

  geom_smooth(method='lm', col='orange', se=FALSE) +
  facet_wrap(~group) +
  xlab("education") +
  ylab('earnings') +
  ggtitle("hypothetical wage and education results") +
  theme_minimal() +
  scale_fill_manual(values = c("white", "black") )

Selection problem: Heckman model

  • Stage 1: Make a model to predict observation
  • Stage 2: Include a measure of inverse selection probability as a covariate

Selection problem: Heckman model

The inverse mills ration is a linear transformation of the inverse selection probabilities. So IMR values will be higher for observations that more closely resemble the missing observations.

# predict the missingness

stage_1<-glm(lfp == "in labor force" ~ educ_year,
                 data=df, 
                 family=binomial(link='probit'))
# the inverse mills ratio of selection probability
first_stage_lp <- predict(stage_1)
df$imr <- dnorm(first_stage_lp)/pnorm(first_stage_lp)
# the second stage
stage_2 <- lm(earnings ~ 1 + educ_year + imr, 
                   data=df[which(lfp == "in labor force"),])

Selection problem: Heckman model

The corrected coefficients now resemble the real model (although we still need to correct the standard errors)

modelsummary(list("actual model"= true_model, 
       "complete case model" = observed_model, 
       "heckman corrected model" = stage_2))
tinytable_2i16l2whwxf8okq6dz9z
actual model complete case model heckman corrected model
(Intercept) 9.267 68.450 29.252
(6.063) (5.962) (21.542)
educ_year 9.453 6.594 8.489
(0.390) (0.367) (1.066)
imr 32.313
(17.067)
Num.Obs. 1000 800 800
R2 0.370 0.288 0.291
R2 Adj. 0.370 0.287 0.289
AIC 10564.0 8092.0 8090.4
BIC 10578.7 8106.1 8109.2
Log.Lik. -5279.006 -4043.012 -4041.217
F 586.460 322.149 163.388
RMSE 47.47 37.90 37.81

Selection problem: Heckman model

The sampleSelection pacakage can estimate this in a single step, and will automatically produce corrected standard errors.

library(sampleSelection)
model<-selection(
  # stage 1 model
  lfp=="in labor force" ~ educ_year, 
  # stage 2 model
  earnings ~  educ_year, 
  
  method='2step', data=df
          )
# showing only the coefficients and se in formatted table
coef(summary(model), part='outcome')|>
  huxtable()|>
  add_rownames(colname='variable')|>
  add_colnames("")
variableEstimateStd. Errort valuePr(>|t|)
(Intercept)29.3 22.5 1.3 0.195  
educ_year8.491.127.578.2e-14
invMillsRatio32.3 17.5 1.840.0657 

Major takeaways

  • For MCAR (truly missing at random):

    • Normal listwise deletion is fine! The real question is “how do you know this is the situation?”

    • descriptive statistics can help. MCAR data should be roughly balanced on observed characteristics.

  • For MAR (missing conditional on observed data):

    • Normal listwise deletion still works provided you can condition on the thing that causes data to be missing.

    • Imputation is probably harmless for these cases

Major Takeaways

  • For MNAR (missing conditional on unobserved variables):

    • Missing predictors: impute missing observations.

    • Missing outcomes: use a two stage model

  • For all missing data: less is better. If you can show the problem is minimal even under extreme assumptions, then there’s a lot less reason for concern.